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Abstract 

We consider the maximum likelihood (Viterbi) alignment of a hidden Markov model 
(HMM). In an HMM, the underlying Markov chain is usually hidden and the Viterbi 
alignment is often used as the estimate of it. This approach will be referred to as the 
Viterbi segmentation. The goodness of the Viterbi segmentation can be measured 
by several risks. In this paper, we prove the existence of asymptotic risks. Being 
independent of data, the asymptotic risks can be considered as the characteristics 
of the model that illustrate the long-run behavior of the Viterbi segmentation. 

Keywords: hidden Markov model, Viterbi alignment, segmentation. 

1 Introduction 

The present paper deals with asymptotics of the Viterbi segmentation. Before we 
can present main results, we introduce the segmentation problem and different risks 
for measuring goodness of segmentations. 



1.1 Notation 

Let Y = {Yt}'^_^ be a double-sided stationary MC with states 5={1,...,|5|} and 
irreducible aperiodic transition matrix Let X = {Xt}'^_^ be a double- 

sided process such that: 1) given {Yt} the random variables {Xf} are conditionally 
independent; 2) the distribution Xj depends on {Yt} only through Yj. The process 
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X is sometimes called a hidden Markov process (HMP) and the pair {Y, X) is referred 
to as a hidden Markov model (HMM). The name is motivated by the assumption 
that the process Y, which is sometimes called the regime, is non-observable. The 
distributions Ps := F{Xi G -11^1 = s) are called emission distributions. We shall 
assume that the emission distributions are defined on a measurable space {X,B), 
where X is usually M'^ and B is the Borel cr-algebra. Without loss of generality we 
shall assume that the measures Pg have densities fg with respect to some reference 
measure fi. Our notation differs from the one used in the HMM literature, where 
usually X stands for the regime and Y for the observations. Since our study is 
mainly motivated by statistical learning, we would like to be consistent with the 
notation used there and keep X for observations and Y for latent variables. 
HMMs are widely used in various fields of applications, including speech recognition 
\21\ [9] , bioinformatics |14| [6] , language processing [20] , image analysis [19] and many 
others. For general overview about HMMs, we refer to [3] and [7]. 

Given a set A and integers m and n, m < n, we shall denote any (n — m + 1)- 
dimensional vector with all the components in A. by aj^ := {um, ■ ■ ■ , an)- When 
m = 1, it will be often dropped from the notation and we write a"" G A^. 



1.2 Segmentation and risks 

The segmentation problem consists of estimating the unobserved realization of the 
underlying Markov chain Yi, . . . , 1^ given n observations = {xi, . . . ,Xn) from a 
hidden Markov model. Formally, we are looking for a mapping g : Af" — > 5" called a 
classifier, that maps every sequence of observations into a state sequence (see [T^ for 
details). For finding the best g, it is natural to set to every state sequence s" £ S*"" 
into correspondence a measure of goodness of s", referred to as the risk of s^. Let us 
denote the risk of for a given x^ by i?(s"|x"). The solution of the segmentation 
problem is then a state sequence with minimum risk. In the framework of pattern 
recognition theory the risk is specified via a loss function L : S*" x S"" — t- [0, oo], where 
L{a^, If') measures the loss when the actual state sequence is a" and the estimated 
sequence is 6". For any state sequence G the risk is then 

a"g5" 

One common loss function is the so-called symmetric loss L^o defined as 



ioo(a",6") 



1, ifa"y^6"; 
0, if a" = 6". 



We shall denote the corresponding risk by i?oo- With this loss, Roo{s'^\x'^) = P(y" ^ 
s'^lX" = x"^), thus the minimizer of Roo{-\x'^) is a sequence with maximum posterior 
probability, called the Viterbi alignment. The name is inherited from the dynamic 
programming algorithm (Viterbi algorithm) used for finding it. Let v stand for 
the Viterbi alignment, i.e. f(x") = argmax^n p(s"|2;"), where p(s'^\x'^) = P(y"' = 
gn^j^n _ ^ny Obviously, the Viterbi alignment is not necessarily unique. The 
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Viterbi alignment minimizes also the following risk: 

i?oo(s"|x") := -ilnp(s"|2;'"). (1.1) 
n 

The log-likelihood based risk (jl.ip is often preferable to use since it allows various 
generalizations, see (|1.4p . Another common classifier is based on the pointwise loss 
function ^ 

Ll(a^6") = iV/(at,6^), (1.2) 

where l{at,bt) > is the loss of classifying the t-th symbol at as bt- Typically, for 
every state s, l{s,s) = 0. Let us denote the corresponding risk by i?i(s"|x"): 

1 " 

R,{s^\x^) = -y2R\{st\x'^), 
t=i 

where R{{s\x'^) := J2aeS K^, s)pt{a\x'') and pt(a|x") := P{Yt = a|X" = x"). Most 
frequently l{s,s') = I^s=is'}: ^-iid then i?i(s"'|rE") just counts the expected number 
of misclassified symbols given that the data are and the sequence s" is used for 
segmentation. For that 

1 " 

Ri{s^\x^) = l--y^pt{st\x^). (1.3) 

The minimizer of (|1.3p over all the possible state sequences is called the pointwise 
maximum a posteriori (PMAP) alignment. The Viterbi and the PMAP-classifier - 
the so-called standard classifiers - are by far the two most popular classifiers used 
in practice. 

We shall also consider the risk 

1 " 

R,(s^\x-) := — Y^lnptistK). 

The risks Ri and are closely related. Minimizing (jl.3p over all possible state 
sequences is clearly equivalent to minimizing ^i, but this is not necessarily so for 
restricted minimization. The importance of and Roo becomes apparent in |12) . 
where the following penalized ^i-risk is considered: 

i?c(s"|x") := i?i(s"|x") + Ci?oo(s"|a;"). (1.4) 

Here C > is a given regularization constant. The risk Rq naturally interpolates 
between the two standard alignments: for C = the minimizer of (|1.4p is the 
PMAP-alignment, and it is not hard to see that for C big enough the minimizer of 
p.4p is the Viterbi alignment. Obviously, the likelihood of the minimizer of (jl.4p 
increases with C as well as the ^i-risk. In |12| it is shown that minimizing the 
risk Rc for an integer C is closely related to maximizing the expected number of 
correctly estimated tuples of C + 1 adjacent states. In |12| it is also shown that 
minimization of Rc{s^\x^) as well as of /2i(s"|x") + CRoo{s"'\x"') can be carried out 
by a dynamic programming algorithm that is similar to the Viterbi algorithm and 
easy to implement. 
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1.3 Organization of the paper and main results 

The quantity R{g,x"') := R[g{x'^)\x'^) measures the goodness of a classifier (7, when 
it is apphed to the observations x". When g is optimal in the sense of risk, then 
R{g,x^) = minjjn i?(s"'|x"') =: R[x'^). We are interested in the random variables 
R{g,X^). The present paper deals mostly with convergence of the risks of Viterbi 
alignments. The results are all largely based on the regenerativity of the Viterbi 
process {Vf}^]^, which is an S"- valued stochastic process that is in a sense the limit 
of the random vectors v{X"') as n grows. The existence of the Viterbi process is 
crucial and not obvious, our analysis is based on the results in |18| \T7\ I13| . where 
the Viterbi process is constructed piecewise. 

In this paper we shall show that under fairly general assumptions on an HMM, the 
random variables Ri{v,X"'), Ri{v,X"') as well as Roo{X"') := Roo{v,X"-) all con- 
verge to constant limits almost surely. These convergences are stated in Theorems 
13.11 14.11 and 15. 1^ which are the main results of the paper. The limits - asymptotic 
risks - are constants that all depend on the model and characterize the goodness of 
the segmentation based on the Viterbi alignment. If, for example, Ri is the limit of 
Ri{v,X"') and R^ is the limit of then the difference Ri — Rl shows how 

well the Viterbi alignment performs the segmentation in the long run in the sense 
of i?i-risk in comparison to the best possible alignment. If i2i-risk is defined as 
in ()1.3p . then for n big enough the Viterbi alignment makes approximatively nRi 
classification errors, while the best alignment in this case - the PMAP-alignment - 
makes approximatively nRl errors. Since the model is known, the asymptotic risks 
could in principle be found theoretically, but the convergence theorems show that 
they could also be found by simulations. 

The results concerning the construction of the Viterbi process are introduced in Sub- 
section [2]2j The piecewise construction under general assumptions is rather technical 
(see |18| 113)). However, when it is performed, the regenerativity of the Viterbi pro- 
cess as well as the ergodicity of the double-sided Viterbi process easily follow. The 
references to necessary results from the theory of regenerative processes are given in 
Subsection 12.11 

Section [3] deals with the convergence of the i?i-risk. We prove that con- 
verges to a constant Ri almost surely. Section [4] proves the convergence of the 
.Ri-risk for the Viterbi and PMAP-alignment. Since the regenerativity of the PMAP- 
process which is the analogue of the Viterbi process for the PMAP-alignment, is not 
proved, the regenerativity-based methods cannot be used for the long-run analysis 
of PMAP-alignments. However, as shown in |15| . the convergence of the i2i-risk of 
the PMAP-alignment can be proved with a completely different method based on 
the exponential forgetting or smoothing probabilities. The exponential forgetting 
inequalities are introduced in Subsection 12.31 and in Section H] we show that they 
imply also the convergence of the .Ri-risk of the PMAP-alignment. In Section [5l the 
convergence of the log-likelihood or .Roo-risk is proved. 

There is no universal method known yet to prove the convergence of general risks 
and every optimal alignment needs a special treatment. For example, the conver- 
gence of Rc^X"") = min^n Rc(s"'\X'^) (as well as of several other more general risks 
introduced in |12| ) has not yet been proved, although it is reasonable to conjecture 
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that it holds. Moreover, we conjecture that the dynamic programming algorithm 
for finding the minimizer of ^c'-risk together with the exponential smoothing could 
be used to find the ^^-optimal alignment process piecewise. If this is true, then the 
alignment process is regenerative and the results and methods in the present paper 
can be applied to many other optimal alignments. 

2 Preliminary results 
2.1 Regenerativity 

We are following the coupling approach developed by Thorisson in |22| . One of 
the main instruments we are going to use is that any regenerative process can be 
successfully coupled with a stationary and ergodic regenerative process (Theorem 
I2.ip . With a successful coupling, a general pathwise limit theorem for the Viterbi 
alignment (Theorem 12. 3p can be proved. This is the main preliminary result and it 
can be used for many other purposes besides proving the convergence of risks. 
Let Z = {Zt}'^i in (O, P) he a Z := M'^-valued classical regenerative process 
with respect to the renewal process S = {St}'^Q (see, e.g. Chapter 10 in |22|). 
Following the notation in |22| . we shall denote the regenerative process by {Z,S). 
Let Ti := Si — Sq. The regenerative process {Z, S) is positive recurrent if ETi < oo 
and aperiodic if Ti is aperiodic, i.e. P(Ti € oN) < 1 for every a > 1. A pair (Z', S') is 

a version of the regenerative process (Z, S) if it is also regenerative and 9sq {Z, S) = 

9g/^{Z' , S'), where 9t is a shift operator: 6t{xi, X2, ■ ■ ■) = {xt+i, Xt+2, ■ ■ ■), and = 
means equal in law. The version {Z°,S°) := 9so{Z,S) of {Z,S) is a zero-delayed 
regenerative process. Thus, Sq = Ti. Recall that {Z,S) is stationary if 9t{Z,S) 
has the same distribution as (Z, S). If (Z, S) is positive recurrent regenerative, then 
there exists a stationary version {Z*,S*) of this process such that the distribution 
of the delay length 5q is given by 

P(5o* = ^^) = -^P(ri>fc), A;>0, 

and for every o"(2^°°)-measurable function g : — > M the following inequality 
holds: 

Eg{Zl,Zl,..) = aietiZ"))], (2.1) 

^ t=0 

see, e.g. Theorem 2.1 and 2.2 of Chapter 10 in |22| or Theorem 6.1 in |10| . 

Recall that a sub-cr-algebra of T is called trivial if its elements have probabil- 
ity 1 or 0. In the following we consider two a-algebras: the tail-c-algebra T : = 
r\^^9^'^{a{Z°°)) and the cr-algebra of shift-invariant sets Z := {A e a{Z°°) : 
9^^ A = A}. A stationary Z-trivial process is ergodic. Since Z T (see Section 
5.1 in |22|). a stationary T-trivial process (sometimes also called regular) is also 
ergodic. The following version of Theorem 3.3 of Chapter 10 in |22| states that an 
aperiodic positive recurrent regenerative process can be successfully coupled with a 
stationary ergodic process. 
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Theorem 2.1 Let {Z, S) be an aperiodic and positive recurrent regenerative process. 
Let {Z*,S*) he a stationary version of it. Then the following statements hold: 

a) The space [Vt, J-", P) can he extended to support a finite random time T and a 

copy Z' of Z* such that {Z,Z',T) is a successful exact coupling of Z and Z* , 
i.e. 

OtZ = OtZ', where Z' = Z*. 

b) The processes Z and Z' are T-trivial. 

Proof. The process Z is aperiodic, which means that Ti is a lattice with span 1. 
Since {Z, S) and {Z*, S*) are discrete, the random variables Sq and are Z- valued. 
So the assumptions of Theorem 3.3 of Chapter 10 in |22j are fulfilled. The claim a) is 
claim a) of that theorem, the T-triviality of Z is claim d) of that theorem. Finally, 
the process Z', being a stationary version of Z, is also an aperiodic regenerative 
process with S'q being Z-valued. Hence it satisfies the same assumptions and is 
therefore also T-trivial. ■ 

Corollary 2.1 Let {Z,S) he an aperiodic and positive recurrent regenerative pro- 
cess and let {Z*,S*) be a stationary version of it. Let g : — )> M 6e such that 
E\g{Zl,Z*,...)\ < oo. Then 

1 

-'^g{Zt,Zt+i,. . .) ^ E[g{Zl,Z2,. . .)] a.s. and in Li. (2.2) 

t=i 

Proof. Let us extend the space (il., J-", P) so that the statements of Theorem 12.11 
hold. Then the process Z' is stationary and ergodic having the same distribution as 
Z* . By Birkhoff's ergodic theorem then, 

1 

-Y^giZ'^, . . .) ^ E[g{Z[,Z!„ ...)] = E[g{Zl, Z*,.. .)] a.s. and in Li. 

(2.3) 

Since the original process Z can be successfully coupled with Z' , it holds for almost 
every realization of Z and Z' that they differ at the finite beginning only. Since for 
a pathwise limit the beginning does not matter, we immediately get the almost sure 
convergence of (j2.2p . The Li-convergence follows from applying Scheffe's lemma 
separately to g^{Zt, Zt+i, . . .) and g~{Zt, Zt+i, ...).■ 

Remark: If (Z, S) is positive recurrent but not aperiodic, then Theorem 12.11 can- 
not be applied. However, using Theorem 2.2 of |22| and noting that aperiodicity 
is not used in its proof, a similar result can be obtained for shift-coupling instead 
of exact coupling. The process Z' can be shown to be Z-trivial and hence ergodic, 
thus Corollary 12.11 still holds. In this paper we consider only aperiodic regenerative 
processes. 

If / : 2 — 7- M is measurable, then the convergence (12. 2p together with (12. ip yields 
1 n ^ Ti Si 

-Y^f{Zt)^Ef{Zl) = ^E[Y^f{Zt)\=^E[ fiZt)] a.s. andinLi. 

t=l ^ t=l ^ i=5o + l 

(2.4) 
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2.2 Infinite Viterbi alignment 

2.2.1 One-sided infinite Viterbi alignment 

Def. Let for every n, g'^ : A"" — )■ 5"" be a classifier. We say that the sequence {g"} 
of classifiers can be extended to infinity, if there exists a function g : — t- 5"°° 
such that for almost every realization x°° S X°° the following statement holds: for 
every /c G N there exists m{x°°) > k such that for every n > m the first k elements 
of are the same as the first k elements of g{x^), i.e. g^{x'^)i = g{x°°)i, 

i = 1, . . . ,k. The function g will be referred to as an infinite alignment. 

If every observation is not classified independently, then the existence of an infi- 
nite alignment is not trivial. It often happens that adding one more observation 
Xn+i changes the alignment 51" (x"). This happens often with Viterbi or PMAP- 
alignments. The existence of an infinite alignment allows to study asymptotic prop- 
erties of the alignment, which is usually done via the corresponding alignment process 
{Gt}'^i := g{X). We consider the existence of infinite Viterbi alignments. Under 
rather restrictive assumptions on HMMs the existence of an infinite Viterbi align- 
ment was first proved in [3]. In |18| it was proved under less restrictive assumptions. 
We now introduce these assumptions and the corresponding results. 

Recall that fa are the densities of Ps ■= P(-'^i G -{Yi = s) with respect to some 
reference measure /x on {X,B). For each s € S, let Gg ■= {x & X : fs{x) > 0}. We 
call a subset C d S a cluster if the following conditions are satisfied: 



Hence, a cluster is a maximal subset of states such that Gq = n^gcGs, the inter- 
section of the supports of the corresponding emission distributions, is 'detectable'. 
Distinct clusters need not be disjoint and a cluster can consist of a single state. In 
this latter case such a state is not hidden, since it is exposed by any observation 
it emits. If \S\ = 2, then S is the only cluster possible, because otherwise the un- 
derlying Markov chain would cease to be hidden. The existence of C implies the 
existence of a set Xq C RsgcGs and e > 0, M < 00 such that H^Xq) > 0, and Vx G Xq 
the following statements hold: (i) e < miiis^c fs{x)', (ii) max g fs{x) < M; (iii) 
maxs^c fs{x) = 0. For proof, see |18| . 

The following two assumptions on HMMs are needed for the existence of an infinite 
Viterbi alignment. 

Al (cluster-assumption) There exists a cluster C C S such that the sub-stochastic 
matrix R = {P{i, j))ij^c is primitive, i.e. there is a positive integer r such that the 
rth power of R is strictly positive. 
A2 For each state / € S, 



The cluster assumption Al is often met in practice. It is clearly satisfied if all 
elements of the matrix P are positive. Since any irreducible aperiodic matrix is 



minPj{ris£cGs) > and max Pj{r\s£cGs) = 0. 




(2.5) 
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primitive, the assumption Al is also satisfied if the densities fg satisfy the following 
condition: for every x X, miiis^s fsix) > 0, i.e. for all s (z S, Gs = X . Thus, 
Al is more general than the strong mixing condition (Assumption 4.3.21 in [1]) and 
also weaker than Assumption 4.3.29 in Note that Al implies the aperiodicity 
of Y, but not vice versa. The assumption A2 is more technical in nature. In |13) 
it was shown that for a two-state HMM, (12. Sp always holds for one state, and this 
is sufficient for the infinite Viterbi alignment. Hence, for the case |5| = 2, A2 
can be relaxed. Another possibilities for relaxing A2 are discussed in |17| I18| . To 
summarize: we believe that the cluster assumption Al is essential for HMMs, while 
the assumption A2, although natural and satisfied for many models, can be relaxed. 
For more general discussion about these assumptions, see |17| [T8| [T5| I13| . 
In the following, let = where is a finite Viterbi alignment. Let Ut 

and Wt be the stopping times defined as 

Wt = min{T > t + r+l : X;„^ G ^o^^} , Ut = max{T < t-r-l : A;+^ G -^q • 

(2.6) 

The results of the present paper are largely based on the following theorem, which 
has been proved in |18| I17|. See also Lemma 2.1 in jS]. 

Theorem 2.2 Let {X,Y) = {{Xt,Yt)}'^i be a one-sided ergodic HMM satisfying 
Al and A2. Then there exists an infinite Viterbi alignment v : X°° — )■ S°°. More- 
over, the finite Viterbi alignments : — )■ S*" can be chosen so that the following 
conditions are satisfied: 

Rl the process Z := {X,Y,V), where V := {Vt}^i is the alignment process, is a 
positively recurrent aperiodic regenerative process with respect to some renewal 
process {Stj'^i^; 

R2 there exists an integer m > such that Sq > m and 

1 ) for all j > such that Sj + m < n, V-P = Vt for all t < Sj, 

2) Sj-S,.i>m, j = l,2,...; 

R3 the renewal times {S^} have the following property: 

1) if Sk>t, then Wt<Sk + m, 

2) if Sk < t, then Ut > Sk — m. 

Proof. The required infinite alignment is constructed piecewise, see |18| . The re- 
generativity and positive recurrence is shown in Section 4 of |17| . The aperiodicity 
follows from the aperiodicity of Y that follows from Al. The piecewise construction 
guarantees both R2 and R3. ■ 

From now on we assume that the finite Viterbi alignments : 5" are chosen 

according to Theorem 12.21 These choices of alignments are called consistent. Ob- 
viously, the consistent choice becomes an issue only if the finite Viterbi alignment 
is not unique. In practice, the consistent choices can be obtained just by predefined 
tie-breaking rules. With consistent choices, the process := {(14", ^j, 
satisfies by R2 the following property: = Zt for every t = 1, . . . , ^^(n), where 
k{n) = max{k > : + m < n}. 



8 



We now present a theorem that generahzes Theorem 3.1 of Chapter VI in [T]. The 
proof is based on the same argument and given in Appendix. Let p € N and 
Qp : Z'P — 7> M be measurable. Define for every i = p, . . . ,n 

Uj^ '■= 9p{Z^-p+i, ■ • • , Z^^)- 

If i < ^^(n), then Uf- = Ui:= ^^(Zj.p+i, . . . , Zj). Finally, let 

Mfc:= max + • • • + 

The random variables Mp,Mp+i,... are identically distributed, but for p > 1 not 
necessarily independent. Recall that Z* is a stationary version of Z. 

Theorem 2.3 Let Qp be such that EMp < oo and E\gp{Z^, ... , Z*)\ < oo. Then 
1 

y^Ul" ^ EU„ = Eg„(Zl,...,Zl) a.s. andmLi. (2.7) 

n — p + 1 ^-^ f f p 

i=p 

2.2.2 Double-sided infinite Viterbi alignment 

Def. Let for every 2:1,^2 S Z, g^^ : X^^^'^^^ s^^'^'^^^ be a classifier. We say that the 
set {^f^} of classifiers can be extended to infinity, if there exists a function g : — >■ 
S*^ such that for almost every realization x°^^ G the following statement holds: 
for every A; S N there exists m > k (depending on x'^^) such that for every n > m, 

9-n{^-n)i ~ 9{X-oo)ii ^ ~ • • • i ^■ 

The function g will be referred to as an infinite double-sided alignment. 

The piecewise construction of the infinite Viterbi alignment allows the double-sided 
extension as well. 

Theorem 2.4 Let {X,Y) = {{Xt,Yt)}'^_^ be a double-sided ergodic HMM satis- 
fying Al and A2. Then there exists an infinite Viterbi alignment v : X'^ — )> S'^. 
Moreover, the finite Viterbi alignments v^^ can be chosen so that the following con- 
ditions are satisfied: 

RDl the process [X,Y,V), where V := {Vt}'^_^ is the alignment process, is a 
positively recurrent aperiodic regenerative process with respect to some renewal 
process {St}'^_^; 

RD2 there exists a nonnegative integer m < 00 such that 

1 ) for every j > such that Sj + m < n, V^" = Vt for all Sq < t < Sj; 

2) Sj - Sj^i >m,je Z; 

RD3 the renewal times {Sk} have the following property: 
1) if Sk > t, then Wt < Sk + m. 



9 



if Sk < t, then Ut > Sk — m; 

RD4 the mapping v is a stationary coding, i.e. v{9{X)) = 9v{X), where 9 is a shift 
operator: 6{. . . ,x_i,xo,xi, ...) = (... ,xo,xi,X2, ■ ■ •)• 

Proof. The proof of RDl, RD2 and RD3 is the same as in Theorem 12. 21 Note the 
difference between R2 and RD2. The stationarity of v follows from the fact that 
the barriers in the construction of the infinite alignment are separated (Lemma 3.2 
in dH]). ■ 

In the following, the finite Viterbi alignments f|J are chosen to be consistent. The 
property RD4 is important. Since X is an ergodic process, from RD4 it fol- 
lows that the double-sided alignment process V = {Vt}'^_^ as well as the pro- 
cess {{Xt,Yt,Vt)}^_^ is an ergodic process. Let Z* denote the restriction of 
{{Xt,Yt, Vt)}Z-oc to the nonnegative integers, i.e. Z* = {{Xt,Yt, Vt)}Zi- By RD2, 

Z* is a stationary version of Z as in Rl. Thus {Xq,Yq,Vq) = {X^^Y^ ,Vi) = Z^ 
and we shall often use this. Note that the one-sided Viterbi process V in Rl is not 
defined at time zero so that the random variable Vq always implies the double-sided, 
and hence stationary case. 

2.3 Smoothing probabilities 

Let {X,Y) = {{Xt,Yt)}^_^ be a double-sided HMM. From Levy's martingale 
convergence theorem it immediately follows that for every state j S S" and z, t G Z, 
the limits of the smoothing probabilities P{Yt = := lim„P(l^ = j\X^) 

and P{Yt = j\X^^) := lim^^_ooP(lt = j\X^) exist almost surely. In [l5] it is 
shown that under Al these probabilities satisfy the following exponential forgetting 
inequalities: 

\\P{Yte-\X'r)-PiYte -1X^^)11 <Cp' a.s., (2.8) 
\\P{Yt e -jXr) - F{Yt e < Cp'^-' a.s., (2.9) 

where C is a finite positive random variable, /) G (0, 1), in the first inequality t > 1, 
and in the second inequality n > t > 1. Here || • || stands for the total variation 
distance. In what follows, we shall use the notation Pt{3\x°^oo) '■— P(^ = = 

3 Convergence of i?i-risk 

Let the loss function be defined as in (jl.2p and let t;" be a consistently chosen Viterbi 
alignment. If the underlying Markov chain would not be hidden, the empirical risk 
of the Viterbi alignment could be directly calculated as follows: 

R,{Y\X-) = -Y^KYuv^X-)) = -Y,l{Yt,Vn. (3.1) 

t=\ t=i 

The conditional expectation of Ri{Y'^ , X"") given X" is the random variable Ri{v, X") = 
£;[i?i(y",X")|X"]. Since S is finite and / : 5 x 5 ^ M is bounded, from Theorem 
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[Ql and ([23]) it follows that 

1 

R^(Y",X'')^EI{Yo,Vo) = ^e(^ J2 l{Yt,Vt)) =: Ri a.s. and in Li. (3.2) 

We shall call the constant Ri asymptotic Viterbi risk. It depends only on the model 
(Y, X) and on the loss function /. For l{s, s') = I^s'j^s}^ the actual risk is the average 
number of mistakes made by the Viterbi alignment: 

1 " 
t=i 

and the corresponding asymptotic risk is the asymptotic misclassification probability 
P{Yo + H). 

To our knowledge, the idea of considering the iii-type limits for the Viterbi alignment 
has been first mentioned in [2], the convergence of the empirical risk is also stated in 
[8]. To show the convergence of Ri{v, Xn), we use the following lemma (see Theorem 
9.4.8 in |5]). 

Lemma 3.1 Let Xn be bounded random variables such that Xn — )■ almost surely. 
Let {J^n}^=i be a filtration. Then E[Xn\J~n] almost surely. 

The following theorem is the first main result of this paper. A similar result for the 
PMAP-alignment, namely the convergence of i?i(X") to a constant, is proved in 

m- 

Theorem 3.1 Let {{Yt,Xt)}^i be an ergodic HMM satisfying Al and A2. Then 
there exists a constant i?i > such that the empirical risk and the risk of the Viterbi 
alignment both converge to Ri almost surely and in Li: 

lim i2i(y",X'') = lim Ri{v,X'^) = Ri a.s. and in Li. 

n— j-oo n— >oo 

Moreover, the expected risk of Viterbi alignments converges to Ri as well: ERi{v, X^) — )• 
Ri. 

Proof. The convergence of the empirical risk is (13. 2p . To show that Ri{v, X") — )• Ri 
a.s., apply Lemma O with X„ := i?i(y",X") - Ri. Clearly, i?i(y",X") - Ri is 
bounded and by (13. 2p it goes to a.s. Thus, by Lemma 13.1^ 

|^[i2i(y",X")-i?i|X"]| = |^[i?i(y",A:")|X"]-iii| = \Ri{v,X'')-Ri\ a.s. 

By Scheffe's theorem, the convergence in Li follows by the non-negativity and bound- 
edness of Ri{v,X^). The convergence in Li implies the convergence of expected 
risks. ■ 
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4 Convergence of i?i-risk 

For the convergence of ^i-risk we use Theorem 12.41 Recah that the double-sided 
infinite alignment u is a stationary coding. Consider the function / : X'^ (—00, 0], 
where 

/(x-^) := \np^{v{x°^^)^\x^^) = lnP(yo = l^ol^^oo = ^^^oc)- 

In the following, let Vi{x'^^) := v{x'^^)i be the i-th element of the infinite alignment. 
Note that for every t = 1, 2, . . ., 

/(^*(^-oo)) =lnpo(^o(et(:r~^))|^i(x!°oo)) =lnpt^(^i(^~oo))k-oo) 
= \npt{vt{x<^^)\x°^^) = \nV{Yt = Vt\Xl 



'-00 
-00 



Thus, by Birkhoff 's ergodic theorem, there exists a constant Ri such that 
1 " 

- - ^lnP(yj = Vt\X??^) ^ -^(lnP(yo = l^l^-oo)) =: ^1 a.s. and in L,, 

t=i 

(4.1) 

provided the expectation is finite. The main idea for proving the convergence of 
Ri{v,X"') is the following. Consider without loss of generality a double-sided HMM 
{{Yt,Xt)}Z-oo- Then by RD2, F," = Vt for every So < t < where k{n) = 

max{A; > : Sk + m < n} and {St}t>o is the renewal process as in Theorem 12.41 
Thus, 

. n 5o"l ^k{n) 

VlnPfYt = y.'^lX'^) = V InPCVt = - - V InPm = T4|X") 

n ^-^ n ^-^ n ^-^ 

i=l t=l t=So 

1 

-- y InPm = F.'^IX"). (4.2) 

*='5'fc(n)+l 

The first term in the partition above converges to zero almost surely. We will prove 
that the second term converges to ^1 almost surely and that the third term converges 
to zero almost surely. To prove the convergence of the second term, we need some 
auxiliary results. Let C be the cluster as in Al and let Xq be the corresponding set. 
The proof of the following proposition is given in Appendix. 

Proposition 4.1 Let x"^^ € X"^ he such that for some u, v S N, xZ^~^^' S X^^^ , 
x^^^ G X^~^^ and for every s £ S, lim.nPo{s\x^^) = Po{s\x°°^). Let vq = vo{x?f^). 
Then there exist constants c > and < B < 00 that are independent of data such 
that 

Po{vo\x'^oo) >ceM-B{u + v)]. (4.3) 

The proof of Proposition 14. II reveals that it holds also for a finite sequence of obser- 
vations x". Moreover, the following corollary holds. 
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Corollary 4.1 Let x" € X"' be such that for some w < n — r, x^"'"'' G ■ 
Let vt = v'i{x^). Then there exist c > and < D < oo such that for every t, 
w < t < n, 

Ptivt\x"') > cexp[-D{n - w)]. (4.4) 

The proof of Corollary 14.11 follows the one of Proposition 14.11 and is sketched in 
Appendix. 

Lemma 4.1 There exists q > such that for every t € 

E( ^r.. l.^oo X <oo. (4.5) 



P{Yt = Vt\X??^) 

Proof. Let Wq and Uq be the stopping times defined in ()2.6p . Because for every 
s € S, lim„P(yo = s\X1J = P{Yo = s|X^^) almost surely from dlJ]) it follows 
that 

F{Yo = Vo\X?^^)>cexp[-B{Wo-Uo)] a.s. (4.6) 
It holds that for some positive constants a and b and for every k = 1,2,..., 

P(Wo > k) < aexp{-bk), 

see, e.g. [8]. This inequality implies that for a > small enough, E(e"^°) < oo. 
Analogously, for sufficiently small a > 0, £^(e"(~^°)) < oo. Thus, by the Cauchy- 
Schwartz inequality it holds that for sufficiently small a, 

The inequalities (g^]) and (g^]) imply ([33]) for t = 0. By the stationarity of iX,Y), 
()4.5p holds for arbitrary t. ■ 



Recall the inequalities (j2.8p - (j2.9p . Unfortunately these bounds do not immedi- 
ately hold for the logarithms. The following lemma uses the inequality | Ina — ln6| < 
minja b} ~ ^1' Provided that a,b > 0. 

Lemma 4.2 Suppose that for an a > 0, 
Then 

Sk{n) 

lim y lnP(yj = V^|X") = ;?i a.s. (4.9) 



t=l 



Proof. Let 6 := P{Yt = Vt\X^^), rjf := P{Yt = 14|X"), r/t := P{Yt = Vt\Xf°) and 
let /3 = ^. Take m = n — (Inn)^. Split the sum in (|4.9p as 



— In 7?" = In 7]" In 77" = Termj + Termjj . 



n n n 

t=i t=l t=m+l 
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We will prove that Termj converges to and Termjj to zero almost surely. 

Termi. Recall that {^j} is a stationary ergodic process. The assumption (14. Sp 
ensures that E'lln.^ol < oo. Hence, by assumption, 

oo -.00 

J2^{Ct < -p) = Ep(^r" > < mr) + i<^. 

t=i t=i 

Thus, the sequence ^t, t = 1,2,..., satisfies P(i^t > ^ ev) = 1. From (12. Sp it 
follows that P(ryt > ^^s" 6v) = 1. Thus, almost surely jlnryj — In^^j < C2t'^p* 
eventually. Since — ;i X^ILi ~^ ^1 almost surely, we now have 

1 " 

yin?7i^i?i a.s. (4.10) 

n ^-^ 

t=i 

Let (random) T be so big that rjt > ^ when t >T. Observe that for n large enough 
it holds that 

^ - -—hit < (lnn)2 . 

In p mp 

Therefore, for large n and t such that T < t < n — (Inn)^, we have Cp^"'~^^ < 
By (|2.9p . — ??i I < Cp'^~^ almost surely. Hence, for n large enough and t such that 
T < t < n - (Inn)^, min{7/t,7/"} > ^ and |ln7?" - InT^tl < {^t^C)p'^~^. Thus, as 
n ^ 00, 

lyinry"-lyin 

m ^-^ m ^-^ m — ' - m 

t=l t=l t=l t=T+l 



1 

< — 

m 


T 

E 


In r?r 


t=i 




1 

< — 

m 


T 




E 

t=i 





4Cn nr,»,^2 

1 + n^pC"") ^0 a.s. 

m 

Since m/n — )> 1, it follows from ()4.10p that — ^ X]<^i Inr/" — ?• ^1 almost surely. 
Termii. It remains to prove that 

•S'fc(n) 



- y In?]," ^ a.s. (4.11) 



n 

t=m+l 



By Proposition iH P(Ft = 1^t|X!°oo) > cexp[-S(Wt - C/j)], where f/j and Wt are 
the stopping times defined as in (j2.6p . Observe that when S*! < t < S^n)i then 
according to RD3, C/t > and Wt < ^^(n) + m < n. Therefore, Ut and are 
X^-measurable and for Si <t < S^n)^ 

E [V{Yt = Vt\X^^)\X''] = V{Yt = FtlX") > cE [ei,v[-B{Wt - f/t)]|X"] = cexp[-S(l^i-C/t)] . 

Thus for any k, P(-ln?/" > k) < 'P{B{Wt - Ut) > k + \nc) < aexp[-6A;], where 
the last inequality follows from [8]. Here a and b are positive constants. Since 

Sk{n) Sk{n) 

E lnr?r>.)=P( E -lnr?r>ne)< p(-ln<>-^^ 
< (Inn)^aexp 



(Inn) 

t=m+l t=m.+l t=m+l ^ ' 



b 

(Inn)^ 
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and 

ne 

b- 



y~^(ln?i)^aexp 



(ln?i)^ 



< oo, 



the convergence in ()4.1ip follows by the Borel-Cantelli lemma. ■ 
We are now ready to prove the convergence of Ri{v, X"^). 

Theorem 4.1 Let {{Yt, Xt)}^i be an ergodic HMM satisfying Al and A2. Then 
there exists a constant Ri such that 

1 " 

lim Ri{v,X'') = lim V lnP(yj = Vr\X'^) = Ri a.s. and in Li. 

t=l 

Proof. Consider the partition in (j4.2p . By Lemma |4.2| the second term in (j4.2p 
converges to Ri almost surely. Thus, it suffices to prove that 

1 " 

- y lnP(yt = V:"|X") ^0 a.s. (4.12) 



n 

*='S'fe{n)+l 



For every A; > 0, let 



Mk= max \\nF{Ys,+i = Vl^,\Xn + ---+lnPiYn = K\X^)\. 
Because of RI, for Sk < n < Sk+i and for i such that Sk < Sk + i < n, 

Therefore the random variables are i.i.d. As in the proof of Theorem 12.31 for 
(j4.12p it suffices to show that EMi^. < oo for every k >0, because then (j4.12p follows 
due to the Borel-Cantelli lemma. We shall consider Si. The construction of 
implies that for every k, the observations Xsf.-mT ■ ■ , Xs^-m+r belong to Xq (see 
|18)). Recall that we are considering the case n < 82- Hence, for every t such that 
Si < t < n, hy 



I lnP(yj = < D{n -Si + m) + \ lnc\ < D{S2 - Si + m) + \ lnc\, 

implying that \Mi\ < D{S2 — Si +m)^ + (5*2 — lnc|. The renewal times S2 — Si 
have all moments (see [8l[l7])j hence EMi < 00. ■ 

Remark. Note that the approach of the present section can be easily applied to 
prove the convergence of the i?i-risk: i?i(t;,X") — ?> Ri a.s. Indeed, the counterpart 
of (IHl) is 



1 " 

- ^P(yt = Vt\X^^) ^ £;(P(yo = Vo\X^^)) =:1-Ri a.s. and in Li. 

t=i 

The inequalities (|2.8p and (|2.9p immediately imply 

1 " 

lim - y P{Yt = T4|X") = 1-Ri a.s., 



t=l 
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and since the probabihties are bounded, the convergence 

1 " 

R^iv,X^) = l--y2P{Yt = Vr\X^)^Ri a.s. 



n 
t=i 



now easily follows. 

From the remark above it is clear that the difficulties with the ^i-risk are due to 
unboundedness of lnP(yt = V^"|X"), since in principle P(lt = V^\X^') can be ar- 
bitrarily small. However, the latter is not so when instead of the Viterbi alignment 
the PMAP-alignment is used. Then ma-KsP{Yt = slX"") > \S\-^ . By Birkhoff's 
theorem, 

1 " 

^inaxlnP{Yt = s\X^^) ^ Rl a.s. and in Li , (4.13) 

where Rl is a constant. The inequalities (|2.8p and (|2.9p imply that 

|maxlnP(yt = s|X") - maxlnP(yt = s\X^^)\ < C7|S|(/ + p""*) a.s. 

s s 

Thus, the convergence (14.13P implies the convergence 
1 " 

i?i(X") = ^maxlnP(yt = s|X") ^ Rl a.s. and in Li. (4.14) 



n ^ — ' seS 

i=l 



Hence, the following corollary holds. 

Corollary 4.2 There exists a constant Rl such that Ii4-14\ ) holds. 

5 Convergence of ^oo-risk 

Recall that i?oo(^") = -^lnP(y" = y"|X") and = r;"(X"). Let p(x") be the 
likelihood of and let p(x"|s") denote the conditional likelihood of observing 
given that {Y"' = s"}. Note that lnp(2;"|s") can be expressed as 

n n n 

lnp(x-|5")= J^ln^(xi)= J]ln/i(xt)/i(si) + ---+J]ln/|5|(xt)/|5|(si). (5.1) 

t=i t=i t=i 

To prove the convergence of Roo{X^), write P(y" = V^\X^) as 

pry" = v-\x^) = p(-^"l^")P(^" = ^") 

^ ' ^ p{X'-) 

Then 

/?^(X") = --f lnp(X"|y'^) +lnP(y" = y") - lnp(X")y (5.2) 



n 

Before stating the theorem about the convergence of R^{X'^), we introduce the 
conditional measure Qg := P{Xq (z -{Vq = s), s (z S. As it follows from Theorem 
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the measure Qg is the almost sure hmit of the empirical measure corresponding 
to the Viterbi alignment state s, i.e. for every Borel set A, 

This convergence is the basis of the adjusted Viterbi training introduced in [16^ 117]. 
Note that for every Q^-integrable 

E{g{Xo)Is{Vo)) = E{g{Xo)\Vo = s)P{Vo = s) = m,jg{x)Qs{dx), (5.3) 

where := P(Vo = s). 

Theorem 5.1 Let for every s ^ S the logarithm of the conditional density fg he 
Pg-integrahle. Then 

-Roc{X"-) '^rus In fs{x)Qs{dx) + E[lnpv*v;] + Hx —■ -Roo a.s and in Li, 

s£S •' 

where Hx is the entropy rate of X and pij = P(l2 = j = i)- 



Proof. Consider (|5.2p . To prove the convergence of the first term of the RHS, apply 
(15. ip to the Viterbi alignment. In |11| it was shown that if In fg is P^-integrable, 
then In/s is also Qs-integrable for every s. Then by Theorem 12. 31 and (I5.3p . for every 
state s G 5 
^ n 

-Y,^nfs{Xt)Is{Vn ^ E{lnfs{Xo)Is{Vo)) = m Jin a.s. and in Li. 

t=i 

This together with (|5.ip gives 

-lnp(X'^|y" = ^^mJ\nf,{x)Qs{dx) a.s. and in Li. 

^ s&s 

For the second term use the Markov property 

lnP(y" = T/") = Xn-Kyr, + Inpynyr, H h lnp^„_^^„, 

where vr^ = P(Yi = s). Since is a path with positive likelihood, pyn yn > 
almost surely for every t. Because the number of states is finite, there exists a 
constant M > such that for every i, — lnpf>„y„ < M almost surely. Hence the 
assumptions of Theorem 12.31 hold and, with Pyny-n = T^yn, we get 

1 _ 1 ""^ 

-lnP(y" = t/") = - y^lnp^„^„ E\[npv*vA a.s and in Li, 
n n ^— ^ * t+i 

where £^[lnpV]*\/2*] = X^i jeS ~ ^' ^2* ~ j)- Finally, the Shannon-McMillan- 

Breiman theorem implies the convergence of the third term of the RHS in (j5.2p : 

— lnp(X") —Hx a.s. and in Li. 
n 
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Remark. Note that —-E[ln py^yj] is the entropy rate of Y. By the same argument, 

-lnP(y"|X") ^ y vr, / lnfJx)Ps(dx) - Hy + Hx =: -BL a.s. and in Li, 
n J 

where Hy is the entropy rate of Y. The convergence in Li imphes 

-i£;[inP(y"|x")]^C, 

where the expectation is taken over X" and F". Since i?[lnP(y"|X")] = H{Y^\X^) 
(the conditional entropy of Y"^ given X"), the hmit could be interpreted as the 
conditional entropy rate of Y given X, it is not the entropy rate of Y. Clearly, 
Roc ^ R^xi) a-iid the difference of those two numbers shows how much the Viterbi 
alignment "overestimates" the likelihood. 



A Proofs of Theorem 12. 3L Proposition 14.11 and 
Corollary 14.11 

A.l Proof of Theorem D 

Proof. Partition the sum in (12.71) as 



i=p i=p i=S'fe(„)+l 

Since S^i^^) oo almost surely, from (j2.2p we know that 



n — p + 



•S'fe(n) 



s, 



r— y^Ui-^ Egp{Zl, ... ,Z*) a.s. and in Li. 



(A.l) 



Since ETi < oo and n > p, by SLLN and the elementary renewal theorem 

Sk{n) _ Sk(n) k{n) 



n—p+1 k{n)n — p-\-l 



1 a.s. and in Li. 



Combining this with (jA.ip and taking into account that the sequence { } is 
bounded, we obtain that 



n — p + 



— — '^Ui^ Egp{Zl, ...,Z*) a.s. and inLi. 



i=p 



Note that 



n — p + 



T. E 0? 

i—Sk{n)+^ 



< 



Ml 



k(n) 



< 



Ml 



k(n) 



5'fc(n) + l-p k{n)-p+l' 
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Since the random variables Mk, k > p, are indentically distributed, it holds for every 
e > that 



k=p k=p 

Thus, by the Borel-Cantelli lemma ^ — > almost surely as k ^ oo. Clearly, 



E 



0, so by Scheffe's theorem ^ ^ in Li as well. 



A. 2 Preliminaries for proving Proposition |4JJ and Corol- 
lary 14.11 

Let us start with some notation. For every sequence of observations = (xfc, . . . , x;) € 
^Z-fc+i^ for every sequence of states y\. = (y^, . . . ,yi) G gl-k+i ^^^^ states i,j G S, 
we denote by j3(x^, y^, the following conditional likelihood: 

i-i I 
p{xi,yi,j\i) := P{i,yk) W P{y 

u=k u=k 

Similarly, 

■=^P{xi:yi:j\i), P{xi,yi) :=^pixi,yi\i)TT{i). 

j i 

We also define 

a{xi,s):= Pixi,yi), P{xi\i)= Yl P(^lyk\i)- 

y[eS''-'+'^:yi=s y[eS''-'+'^ 

The last two notations are standard in the HMM literature, see e.g. [313]. Let 

/3{x[,s\i)= Y Pixi,yi\i), a{s,xi):= ^ P(4>yi)- 

y[&S''-'+^:yi=s yi€S''-'+^:yk=s 

Finally, let 

(^{xi,j\i) :=ma^p{xi,yi,j\i), (7{x[\i) := maxp{xi,yl\i). 

Let C be the cluster as in Al. Thus, there is an r > 1 such that the matrix W has 
positive entries. Let Xq be the corresponding set. Suppose € and y^ € C". 
By the definition of Xq, it holds that 



u=l 



By the cluster assumption, < imuijfzc R!" {i, j) < yi)P(yi, 7/2) • • -Piyr-iJ)) < 
1, provided i,j S C. Hence there exist constants < a < A < 00, not depending on 
the observations, such that 



a < pi^x"" ,y''\i) < A and a < p{x'' ^,y'' ^,j\i)<A, j e C. (A. 2) 



19 



Suppose now x"*, m > r, is a sequence of observations such that the first r elements 
belong to the set Xq, i.e. x'' € X^. Then for every i, p{x^ ,y^\i) > only if y^' G C, 
implying that 

(t(x"^, = max max p{x^ ,y^\i)a{x'^,i, j\s). 

Let now ^1,22 & C. Then for some states si,S2 G C, 
= max p{x'' ,y''\ii)a{x'^j^i, j\si), 

J/''eC'-:J/r = Sl 

o-(a;"',j|i2) = max ^(2;'", /|z2)o-(x;!Yi' J'l'^a) > max p{x'' ,y''\i2)a{x''^j^i, i\si). 
y^&C^:yr=s2 S/''eC'':j/r=si 

Hence, the inequalities ()A.2p imply that for every state j 

cT(x"-,j|n) ^ max,.sc-:„.=,,p(x^jyn^l) < ^ 3) 
o-(x'",j|z2) ~ maXj^rgc"-:j/,=siP(x'',2/''|i2) ~ a' 

Similarly, if x™ is such that the last r elements belong to Xq, i.e. x^_j._^_i € Af^, 
then for arbitrary states ji,j2 € C there exist si,S2 € C such that 

C7(x'",ji|«) = max 

y -ym— 

cj(x'",j2|«) = max 

y'"~'' + ^:ym-r + l=S2 

> max 

y-m-r + l .y^_^^^—g^ 

So from ()A.2p it follows that 

cr(x"^,j2K) ~ 0-(a;m_,.+2,i2|si) ~ a' 

Proof of Proposition 14.11 

Proof. Let x^^^^ be a sequence of observations and let x!i„ be its subword. For every 
state i £ S, we are interested in probability pQ(i\x'^^) := P(lo = A^-n — ^-n)- 
Note that 

y'l„-yo=i 

Observe that for every u,v € {1, . . . ,n — 1} and for an arbitrary state, let it be 1, 

^o{xl^,i)= YYYl Y 0({xZn,si)f3ixZl+i,S2\si)P{s2,l)fi{xo)P{x1-\s3\l)P{s3,Si)a{s4, 

siG5s2G5s3GSs4eS 

X^ ^,S4 

; Xv ) 

siG5 S4G5 

>p(x:;:)(mina(x:i+i,l|.))/i(xo)(mina(xr\.|l))p(x-). 



pix""-''-^ 




-r+l 


|i)o-(x;;^_,,+2,ii|si), 


pix""-'^ 




-r+l 


\i)a{x^_,.^2^h\s2) 


p(x™-"^ 




-r+l 


\i)(y{xZ-r+2^h\s\)- 
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Without loss of generality assume vo{x'^^) = 1. Let V-u{x'^^) = a and Vy{x'^^) = 
b. By Bellman's optimality principle, for every io S 

o-(xli_^i,l|a)/i(xo)cr(xi~\6|l) > a{xZl+i,io\a)fi,{xo)a{xl~^,b\io), 

implying that for every state ig, 

, ^ <r{xZl+i,io\a) ^ , a{x^-\b\io) 
fi[xo) > — ^1 T--fio[xo) ■ 

Thus, 

/ n IN ^ ^ _^ (min,t7(xl^^i,l|s)) • I Nf / w ^-1 Al - ^ (min,(7(x^-^g|l)) „ 

cj(j;_„+i,l|a) a{x^ ,b\l) 

(A.5) 

Note that for every x^, 
Therefore, for every io ^ S 

-u+U S2|si)-P('52; ^o)/io(2;o)/3(2;i , ■53^0)^(53; S4)o('54; 2;") 

siG5 S2&S S3G5 S4G5 

^ E E «(^=n,^l)|5|"~''^(^:l+l,^o|5l)/.„(xo)|5r-V(xrSs4|io)a(54,x:;) 

<p(xi::)|Sri(maxa(x:i+i,io|.))/i„(xo)|5ri(maxa(xr\.M)p(x^). 

Let be such that G ;fX+i and G X^+^ . Then a(2;i;^,si) = if 

si C, since x.^j G Xq. Analogously, 0(54, x") = if S4 ^ C. Thus, in this case the 
inequality above becomes 

7o(2;"„,«o) <p(xi;^)|Sr-^(maxa(x:i+i,io|s))/i„(xo)|5r"^(max(7(x^-\s|zo))p(x[:). 
The same holds for (jA.Sp . implying that 

7o(2;"„,l) ^min,ecf^(a;li+i,l|s) (y{xZl+i,io\a) 



7o(a^-n;^o) cr(x_i+;^, l|a) max^gc io|s) 

^ o-(xi~\ b|io) min^gc cr(xi~\ g|l) |^|2-(^+^,) 
maXsgccr(x^"\s|io) cj(x'j'"\ 

The inequalities ()A.3p and ()A.4p imply that the ratios above are bounded below by 
■|- that does not depend on the observations. Thus, there exist constants ci and 
< i? < 00 (not depending on the data) such that for every state Zq, 

Po(lk-n) 7o(a^-n^l) ^ r ^ M (^a\ 
Since X]ies^'o(^|2;"„) = 1, there exists io such that po{io\x'^^) > l^l"^. Thus, by 

dASl), 

Po(lk!!„,) > ^exp[-i?(n + ?;)]. 
Because po(l|a^-n) ~^ Po(l|a;!^^oo)' inequality (|4.3p follows by taking c = j^. ■ 

21 



Proof of Corollary 14.11 

Proof. The proof is analogous to the proof of Proposition 14.11 Using the same 
notations we obtain that for every t, w < t < n, 



Jtix", vt) > pix"") ( min a{xl^^^,vt\s)) fy, {xt)a{x];^^^ \vt). 

s€0 



For every Iq G S, 



7^(x^^o)<p(:r-)(maxcT(xL-+\,io|s))/^„(xt)a(x^+l|io)|Sr-'"-^ 
Let Vyoix"") = b. By BeUman's optimahty principle, 

Thus, 

VtiytV"-) ^ -ft{x'',vt) ^ min,scf7(x^"^\,t)f|s) a{x^~^^,io\b) 
Pt(io|x") -n{x"',io) ~ a{x^~^^,'dt \b) max^gc ^o|s) 

Because the ratios above are bounded below by and pt{io\x^) > \S\~^ for some 
io € S, the statement of the corollary follows with D = ln\S\. ■ 
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